Handling Data Skew in MapReduce

نویسندگان

  • Benjamin Gufler
  • Nikolaus Augsten
  • Angelika Reiser
  • Alfons Kemper
چکیده

MapReduce systems have become popular for processing large data sets and are increasingly being used in e-science applications. In contrast to simple application scenarios like word count, e-science applications involve complex computations which pose new challenges to MapReduce systems. In particular, (a) the runtime complexity of the reducer task is typically high, and (b) scientific data is often skewed. This leads to highly varying execution times for the reducers. Varying execution times result in low resource utilisation and high overall execution time since the next MapReduce cycle can only start after all reducers are done. In this paper we address the problem of efficiently processing MapReduce jobs with complex reducer tasks over skewed data. We define a new cost model that takes into account non-linear reducer tasks and we provide an algorithm to estimate the cost in a distributed environment. We propose two load balancing approaches, fine partitioning and dynamic fragmentation, that are based on our cost model and can deal with both skewed data and complex reduce tasks. Fine partitioning produces a fixed number of data partitions, dynamic fragmentation dynamically splits large partitions into smaller portions and replicates data if necessary. Our approaches can be seamlessly integrated into existing MapReduce systems like Hadoop. We empirically evaluate our solution on both synthetic data and real data from an e-science application.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Handling Data Skew in MapReduce Cluster by Using Partition Tuning

The healthcare industry has generated large amounts of data, and analyzing these has emerged as an important problem in recent years. The MapReduce programming model has been successfully used for big data analytics. However, data skew invariably occurs in big data analytics and seriously affects efficiency. To overcome the data skew problem in MapReduce, we have in the past proposed a data pro...

متن کامل

Handling Skew in Multiway Joins in Parallel Processing

Handling skew is one of the major challenges in query processing. In distributed computational environments such as MapReduce, uneven distribution of the data to the servers is not desired. One of the dominant measures that we want to optimize in distributed environments is communication cost. In a MapReduce job this is the amount of data that is transferred from the mappers to the reducers. In...

متن کامل

Fine-Grained Micro-Tasks for MapReduce Skew-Handling

Recent work on MapReduce has considered the problems of skew, where a job’s tasks exhibit large variance in size and processing cost, and stragglers, tasks that run slowly due to conditions on particular nodes. In this paper, we discuss an extremely simple approach to mitigating skew and stragglers: break the workload into many small tasks that are dynamically scheduled at runtime. This approac...

متن کامل

A Survey on Partitioning Skew Diminishing Techniques in Hadoop MapReduce Environment

In the era of Big Data, it creates large size of structured and unstructured data. MapReduce is an effective tool for parallel data processing. One significant issue in practical MapReduce applications is data skew: the imbalance in the amount of data assigned to each task. This causes some tasks to take much longer to finish than others and can significantly impact performance. Parallel data p...

متن کامل

Handling Data Skew in Map Reduce Using Hadoop Libra

There are many efficient tools significantly uses Map Reduce applications that assigns data with their corresponding tasks in parallel and distributed data processing. LIBRA symbolizes the lightweight problems of data skew with input data applications that can overlap map and reduce strategies. This is one of the innovative and accurate distribution methods for intermediate data sampling with n...

متن کامل

Handling partitioning skew in MapReduce using LEEN

MapReduce is emerging as a prominent tool for big data processing. Locality is a key feature in MapReduce that is extensively leveraged in dataintensive cloud system: it avoids network saturation when processing large amount of data by co-allocating computation and data storage — the map phase. However, our studies with Hadoop, a widely used MapReduce implementation, demonstrate that the presen...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011